Characterizing behavioural differentiation in gene regulatory networks with representation graphs

Abstract We introduce the formal notion of representation graphs, encapsulating the state space structure of gene regulatory network models in a compact and concise form that highlights the most significant features of stable states and differentiation processes leading to distinct stability regions. The concept has been developed in the context of a hybrid system-based gene network modelling framework; however, we anticipate that it can also be adapted to other approaches of modelling gene networks in discrete terms. We describe a practical algorithm for representation graph computation as well as two case studies demonstrating their real-world application and utility. The first case study presents models for three phage viruses. It shows that the process of differentiation into lytic and lysogenic behavioural states for all these models is described by the same representation graph despite the distinctive underlying mechanisms for differentiation. The second case study shows the advantages of our approach for modelling the process of myeloid cell differentiation from a common progenitor into different cell types. Both case studies also demonstrate the potential of the representation graph approach for deriving and validating hypotheses about regulatory interactions that must be satisfied for biologically viable behaviours.


Introduction
Hybrid system-based modelling frameworks have proven to be useful tools for describing and analysing gene regulatory systems ( 1 ).They provide a natural mechanism for integrating the discrete and continuous aspects characterizing a given biological system.From the perspective of simulating biological behaviour, they have the same power as differential equationbased models and have been (sometimes implicitly) used for such a purpose.At the same time, the discrete aspects of hybrid models allow reasoning about all the possible dynamic behaviours of the system in a similar manner to Boolean models, although the state spaces of hybrid models are considerably more complicated to analyse.Some generalizations of the Boolean framework ( 2 ,3 ) can actually be regarded as implicit hybrid systems and share similar state space structures.
Previously we have presented a specific rigidly defined hybrid system-based modelling framework HSM ( 4 ).The framework has been successful for phage virus modelling, allowing the reproduction of biologically known lytic and lysogenic cycles accurately and to prove that under certain constraints these two behaviours are the only stable states allowed for the modelled systems ( 5 ).
This promotes the HSM framework as a good candidate approach for modelling various differentiation processes in gene regulation.The framework by itself, however, remains very versatile, and its main strengths emerge from the analysis methods applied to the models.In this study, we focus on analysis methods and introduce the notion of representation graphs in formal terms.We further demonstrate the usefulness of the representation graph-based approach through two case studies.
The first case study considers previously developed phage virus models and re-analyses the behavioural dynamics of these models in terms of representation graphs.The second case study presents an HSM model of myeloid cell differentiation, derived from a simpler Boolean model ( 6 ).It shows how its representation graph captures evolutionary states leading to cell differentiation and gene activities that trigger transitions between these states.
For both of these case studies, we also demonstrate how the representation graph approach can be used for deriving and validating hypotheses about regulatory interactions, which need to be satisfied in order for the models to describe biologically viable behaviours.

Modelling framework
For the description of gene regulatory networks (GRNs), we use a hybrid system-based HSM framework.Below we give a brief overview; for a detailed technical description, see ( 4 ).The definition of HSM is generally consistent with other widely accepted variations of hybrid system definitions, although it imposes some restrictions and simplifications-the goal is to keep the formalism as simple as possible whilst keeping it fully sufficient to describe the biological processes it intends to model.
HSM models are defined by tuples H = M, X, C, T, F, MF , where M = { μ 1 , …, μ k } is a set of modes; X = { x 1 , …, x m } is a set of continuous variables with real non-negative values; C = { c 1 , …, c r } is a set of real non-negative transition thresholds; T is a set of mode transitions in form τ = α → x λc β, where α, β ∈ M , x ∈ X , c ∈ C , λ ∈ { ≤, ≥}), F = { f 1 , …, f n } is a set of continuous and monotonous functions; and MF : M × X → F is a mapping providing mode-function assignments assigning to each mode α ∈ M and each variable x ∈ X a function g ∈ F .
Intuitively, modes M represent states of the uneventful evolution of biological systems during which no observable events occur.Variables X describe concentration levels of biological substances (such as proteins) and T describes allowed transitions between the modes triggered by substance concentrations reaching specific thresholds or dropping below them.Functions from F describe changes of substance concentrations with time.For each substance, these changes are mode specific, and the behaviour at each of the nodes is specified by the mode-function assignment .
A distinctive aspect of our modelling framework is that we assume that a modelled biological system is fully defined by a particular HSM model H, but our knowledge about the system is limited mostly by qualitative and not precise quantitative information (particularly regarding values of substance concentrations and their changes with time).Thus, essentially, we possess only some discrete knowledge about H, which does not identify the exact model but only some wider class of models to which H belongs.In precise terms, such classes are described by the notion of HSM frames F ( H ) , in which concrete functions from F are replaced by values from the set { ↑ , ↓ } (we assume that for a particular state we know whether substance concentrations are increasing or decreasing) and our discrete knowledge about the system is represented by constraints C ( T ) defined by a partial ordering of transitions in T .
Such an assumption appears to be well justified from the perspective of experimentally measurable data that we can obtain about a specific biological system, with measurements usually being limited to comparative changes in substance (e.g.RNA or protein) concentrations and not their exact values.Also, unless we are mainly focused on the simulation of the modelled system under specific conditions [e.g. as in a hybrid system model developed for cardiac cell gene regulation ( 7 )], there are few options apart from introducing and examining some type of discrete approximations of continuous state spaces inherent in hybrid system models.
Several techniques have been considered for building such discrete approximations, ranging from pre-defined partitioning of state spaces into regions representing different gene activity levels [thus, the state space of an n gene network is regarded as an n -dimensional hypercube ( 8 ,9 )] to gradual stepwise discretization by introducing additional state space subregions during the process of backward reachability analysis from some final steady state ( 10 ).
To an extent, we adopt elements from both of these approaches.The discrete structure of frames F ( H ) is fully specified by the initial HSM, in which functions from F are treated as unknown, apart from the fact of whether they are increasing or decreasing, i.e. they could be replaced by either ↑ or ↓ .The biological system is described by a class of models with the initial set of constraints C incorporating the available biological knowledge.During the analysis phase, this class of models is gradually partitioned into smaller subclasses that represent observationally distinguishable behaviours of the modelled system.

Describing gene regulatory networks with models
The HSM framework by itself does not directly associate any biological context with modes from the set M or with other model parameters; therefore, for useful biological mod-els, such associations should be assigned by their design.The most natural (and effectively the only known) approach is to relate them in some way to gene activity.For example, in Boolean models, the states are directly defined by gene activity levels, assuming two such levels for each gene (either 'active' or 'inactive').In multivalued models, this assumption is further extended by allowing each of the genes to be in one of more than two discrete states ( 11 ).
A distinctive feature of our approach to model development is that we define the system's HSM modes ('states') primarily by the occupancies of gene-binding sites rather than gene activity levels themselves.Correspondingly, the set of variables X represents proteins (products of genes included in the model), with 0 or more binding sites for each x ∈ X .This provides an important possibility to account for (biologically inherent) time delays for gene interactions-these delays fully depend on functions from F governing the substance change rates, although our knowledge about them remains limited by discrete constraints from C ( T ).Such a separation of gene and binding site activities has been introduced for finite state linear models (FSLMs) ( 12 ) and has been the key feature contributing to the modelling power that FSLM formalism provides.With the HSM framework having gradually evolved as a natural generalization of FSLMs, this also remains a key feature for the developed HSM models.
In line with the approach of model design focused on binding site activities, each binding site B [ x ] is assigned two real valued association and dissociation thresholds a ).A binding site-specific protein occupies the site B [ x ] when its concentration reaches a ( B [ x ]); the site then becomes active.When the concentration of that protein drops below d ( B [ x ]), the site becomes free and inactive.Using distinct binding site thresholds d are assumed to be too short to consider their partial overlaps.For two sites B 1 [ x ] and B 2 [ x ] binding the same protein, we assume that their affinities are either the same, ( ), or one of them is strictly higher ( B 1 [ x ] < B 2 [ x ])-such a type of constraint can be included in the initial model from the known experimental evidence or derived at the stage of model analysis.In most cases, each binding site will bind a specific protein.Still, occasionally it may be convenient to consider binding sites that can bind several competing binding factors-notably this is the case for phage virus models.
The set of modes M can be fully defined by binding site activities and the relationship of the current values of X variables to binding affinity constraints.Sets of modes with different relationships of X values to affinity constraints might be indistinguishable and can be merged, thus significantly reducing complexity.Two different modes, however, will necessarily have different binding site activity values.The set of transitions T is defined by Boolean functions, with their values dependent on binding site activity states.Each τ = α → x λc β ∈ T switches activity of one particular gene, changing MF ( x ) from ↑ to ↓ , or vice versa.For each α ∈ M and each x ∈ X , there is at most one transition from α labelled by x .

Analysis of model dynamics
The dynamics of model F ( H ) are described by its state space graph.Let F ( H ) = M, X, C, T, F, MF .Its state space graph is an edge-labelled directed graph G = (M, ˆ T ) with the set of vertices To characterize all the possible dynamic behaviours the modelled system can exhibit, we are mostly interested in the following two properties of its state space G .

Regions of stable behaviour and attractors
Intuitively, stable behaviour regions should be characterized by sets of states A ⊆M such that, once any state of m ∈ A is reached, the system will never reach any state outside A , and all the states from A will be revisited infinitely often.For Boolean models, such regions are known as attractors and are simple cycles.
In more complex state spaces, providing a precise definition for attractors that match these intuitive requirements becomes considerably more difficult.In the HSM framework, we define attractors as strongly connected components (SCCs), for which, additionally, the model does not imply that any trajectories entering these components will eventually leave them.Similar definitions of attractors have been proposed previously for several types of generalized Boolean models ( 3 ,13 ).For hybrid models, however, the state reachability problem is known to be algorithmically undecidable.Due to this, it cannot be guaranteed that two different stable behaviour regions will be separated into different attractors.Nevertheless, such a definition seems to work well for the models that we have analysed.

Definition
By A ( G ) we denote the set of all attractors in state space G .

Switching states and switching regions
We call m ∈ M a switching state if there is a transition (m, m ) ∈ ˆ T , such that fewer attractors can be reached from m than from m .Typical model spaces can contain large numbers of switching states forming long and complex cascades, therefore it is useful to partition them into equivalence classes.Accordingly, we define switching regions as the maximum sets containing switching states of the same type.

Definition
D ⊂M is a switching region if there exists a subset of attractors S ⊆A ( G ), distinct proper subsets S 1 , …, S k ⊂S and distinct labels x 1 , …, x k ∈ X , such that: (i) for all m ∈ D , the set of all attractors reachable from m is S; (ii) for all m ∈ D and i = 1…k , there is a transition from m to m with l (( m , m )) = x i and S i being the set of all attractors reachable from m ; and (iii) D is the maximum set satisfying properties (i) and (ii).
By D ( G ) we denote the set of all switching regions in state space G .

Representation graphs
As a convenient and concise way to describe the most informative features of model state spaces, we propose the formal notion of representation graphs.There are two motivating reasons for introducing them.Firstly, all the stable behaviours of the modelled system and the differentiation processes leading to them are largely characterized by the model's state space attractors and switching regions; the internal structure of these regions is less significant.Secondly, an important observation from the models we have explored was that even very large state spaces (with tens of thousands of nodes) can contain only a few attractors and switching regions.Moreover, similar but different models tend to have the same sets of attractors and switching regions as well as transitions between them, although their state spaces are very different.
In representation graphs, each attractor A ( G ) and each switching region R ( G ) of a state space G = (M, ˆ T ) corresponds to a single vertex.In addition, we are interested in the dependence of attractor and switching state reachability on initial states.To characterize this, we consider a third type of regions, i.e. pool regions in state spaces, and assign a single vertex for each such region in R ( G ).
We say that m ∈ M is a pool state if m ∈ U for any U ∈ A ( G ) ∪ D ( G ) (i.e.m is neither an attractor nor switching state) and m is not reachable from any switching state m (i.e.m does not lie on any path from a switching state to an attractor or another switching state).
such that: (i) for all m ∈ P , the set of all attractors and switching regions reachable from m is S ; (ii) for all m ∈ P , U ∈ S and an edge ( m , m ) on a path from m to U with m ∈ U and m ∈ U : l (( m , m )) = l ( U ); and (iii) P is the maximum set satisfying properties (i) and (ii).
The single element set { l ( U )} containing a label that defines reachability to U ∈ S we denote by l ( P , U ).By P ( G ),we denote the set of all pool regions in state space G .
According to these definitions, there can be no outgoing paths from attractors and no incoming paths to pool regions in G .Paths from switching regions to attractors and other switching regions are possible.For and ( m 1 , m 2 ) is an edge on a path from U 1 to U 2 }.If U 2 is an attractor reachable from U 1 , then l ( U 1 , U 2 ) will be a single element set.
Any node m ∈ M of state space G can belong to either an attractor, a switching region or a pool region, in which case m ∈ U for some Alternatively, a node can belong to neither of these three types of regions, in which case it lies on a path from a switching region to an attractor or another switching region.Due to a lack of a specific role in the modelled system's dynamics, we treat such nodes as 'non-interesting' and do not include them directly in representation graphs.
Depending on a model, we might be interested only in the region of the state space reachable from a specific initial state (e.g. from 0-state with all binding sites being vacant), or the model by its design could be appropriate only for the description of the system dynamics starting from some particular state m .For example, the myeloid differentiation model described in our case study is designed to model only the differentiation phase after progenitor state CMP has already been reached.In such cases, the representation graph R ( G ) can be constructed for the induced subgraph G with the vertex set of G restricted to those that are reachable from the node m .
Below we present an efficient and practical Algorithm 1 for representation graph computation.For a given HSM model frame's state space G = (M, ˆ T ) , the algorithm computes its representation graph R ( G ) = ( V , E ) and its set of attractors A ( G ).The internal structure of attractors (unless they represent single steady states) is of notable interest by itself, and for viable models it could be expected that all the attractors from A ( G ) can be assigned biologically feasible interpretations.
The algorithm starts by explicitly computing the set A ( G ) (finding SCCs is a well-known problem solvable in linear time).The computation of switching regions D ( G ) and pool regions P ( G ) is then done implicitly by performing vertex pair contractions in a copy of the input graph.Vertex contraction is performed by the function ) into a single vertex v 2 and merges sets of incoming and outgoing edges from v 1 and v 2 .Vertex label l ( v 2 ) is replaced by l ( v 1 ) ∪ l ( v 2 ) and, for each edge e resulting from merging two edges e 1 and e 2 , a new label l ( e ) = l ( e 1 ) ∪ l ( e 2 ) is assigned.Initially assigned labels are single-valued; for notational simplicity, such single-valued l we also treat as one-element sets { l }.
Function Join ( S ) can also be applied to any non-empty subset of vertices S ⊆V by iteratively calling The algorithm is implemented in Python and is freely available online in GitHub and Zenodo repositories.The implementation relies on NetworkX ( 14 ) and N2G ( 15 ) libraries.The latter is used for an optional generation of representation attractor graph visualizations in GraphML format, which can be viewed by, e.g.yEd Graph Editor ( 16 ).Sample visualizations are shown in Figure 1 for two non-equivalent HSM frames of the lambda phage model.The representation graphs of both frames are identical (with pool states ignored).However, their attractor regions differ, with only a 16 state lysis attractor (Figure 1A), but not an 8 state lysis attractor (Figure 1B) being biologically feasible.

Workflow of model design and analysis
A typical development of a HSM model starts by defining a model H with mode and transition sets built in accordance with a known gene regulatory mechanism and incorporating additional known biological assumptions as constraints on transition ordering.Such an initial model often has a trivial representation graph with few and non-meaningful attractors.The partial constraints placed in the model, however, can serve as a seed that can be complemented by additional hypothetical constraints.This provides a way to define and test many hypothetical models described by frames F 1 , . . ., F N , for which state space graphs G i and the corresponding representation graphs R i with their sets of attractors A i are then being computed.A set of pairs ( R i , A i ) is then further partitioned into equivalence classes, with ( R i , A i ) and ( R j , A j ) defined to be equivalent if R i ∼ = R j and A i ∼ = A j (i.e. are isomorphic) as vertexand edge-labelled graphs.For representation graphs, vertex labels distinguish between attractor, switching and pool states (and, depending on the focus of interest, pool states could be ignored).For attractors A i , their vertex labels are defined by occupancy states of binding sites.This analysis workflow is schematically shown in Figure 2 .

Results
We present two case studies that demonstrate the benefits of using representation graphs for analysing and understanding the different behaviours that biological systems described by HSM models can exhibit.

Phage virus models
The first case study covers three bacteriophage virus models (Figure 3 ), lambda phage (model LPH2), HK022 phage and  Mu phage.Detailed descriptions of these models and their design considerations are discussed in ( 17 ).
The development of phage models was driven by the availability of a very comprehensive biological description of gene regulatory processes of lambda phage by McAdams and Shapiro ( 18 ).This semi-formal description was used by Brazma and Schlitt ( 12 ,19 ) as a basis for proposing already a strictly formalized lambda phage model described by a FSLM framework.A notable achievement of the model and the proposed FSLM framework was the possibility of obtaining very realistic simulations of the two known characteristic behaviours of lambda phages: lytic and lysogenic cycles.The modelling power needed to accomplish this largely stems from introduction of explicit separation between genes and transcription factor-binding sites and their activities.A limiting feature of FSLM, however, is the assumption that protein concentration changes are described by linear functions, which could be considered an oversimplification from a biological perspective.
For the particular lambda phage model, however, it was also noted that the exact forms of functions governing protein concentration changes have limited importance.At the same time, the monotonicity of these concentration change functions is what matters for simulated behaviours to be consistent with biological observations.This observation was one of the driving factors for developing HSM as a more general modelling framework, which allows much greater flexibility for the choice of concentration change functions and also inherits a convenient way of redefining FSLM models in terms of the HSM framework.Analysis of the LPH1 lambda phage model thus obtained showed that lysis and lysogeny are the only stable behaviours possible according to the model.It also allowed the formulation of several hypotheses about comparative levels of binding site affinities that must hold for the model to describe biologically feasible lytic and lysogenic behaviours ( 5 ).
Here we present a revised and slightly more complex LPH2 model (Figure 3 A) of lambda phage virus, which incorporates more recent biological knowledge ( 20 ).In particular, it introduces an extra binding site for one of the proteins and proposes somewhat different regulation mechanisms for a couple of the involved genes.The model contains 11 proteins and their corresponding genes: cI, cII, cIII, cro, int, N, O, P, Q, xis and Struc (denoting a set of co-regulated structural proteins), and 11 binding sites: bcII-1, bcII-2, bcII-3 (binding cII), bN (binding N), bQ (binding Q), and operator OR and OL binding sites bOR1, bOR2, bOR3, bOL1, bOL2 and bOL3 (each of these sites is competitively binding proteins cI and cro).The dependence of gene activity states from binding site occupancies is defined by the following Boolean formulae (with bS being true for an occupied binding site, and bS[P] being true for binding site occupied specifically by protein P): In terms of the HSM framework, LPH2 is described by a frame with the set of modes M almost being defined by all different occupancy states of the binding sites (up to 23 328 modes, the exact number depends on binding site affinity constraints), and with each variable from the set X representing one of 11 proteins.The Boolean formulae of gene activities define a set of transitions T and mode-function assignments MF .
The biologically known constraints of binding site affinities for the LPH2 model are the following: bcII-2 < bcII- with comparative affinities between bOR and bOL operator sites and for the bcII-3 site being unknown.These constraints lead to 1200 dif-ferent hypothetical HSM frames to be considered during the model analysis phase.
The HK022 model (Figure 3 A) is based on general biological knowledge ( 21 ,22 ) and is broadly similar to LPH2 but has a slightly different regulatory mechanism.Notably, however, HK022 does not have antitermination gene N , and the lack of it has been considered as a difficulty to fully understand how its regulatory mechanism behaves.The model includes 10 proteins and their corresponding genes: cI, cII, cIII, cro, int, O, P, Q, xis and Struc, and nine binding sites: bcII-1, bcII-2, bQ, and operator OR and OL binding sites bOR1, bOR2, bOR3, bOL1, bOL2 and bOL3.The gene activities are defined by the following Boolean formulae: The known constraints of HK022 binding site affinities are the same as those of the lambda phage LPH2 model, which leaves for analysis 400 different hypothetical HSM frames with up to 5832 modes.
The Mu model (Figure 3 B) is based on the biological description of its regulatory network by Paolozzi and Ghelardini ( 23 ).The virus exhibits a similar type of lytic and lysogenic behaviour to lambdoid phages; however, the genes and regulatory mechanisms involved are very different.The model includes five proteins and their corresponding genes: c, ner, SEE, trsp and AB (denoting two co-regulated and co-binding proteins A and B), and five binding sites: bNL, bNR (binding ner), bO2, bO3 (binding c) and bIAS (binding AB).The gene activities are defined as follows: The known binding site affinity constraint is bO2 < bO3; comparative affinities of bNL and bNR sites are unknown.This gives for analysis two HSM frames with up to 32 modes in each.Comprehensive analysis of the individual models is presented in ( 17 ), but here we would like to highlight the fact that, apart from labels that correspond to gene names, the representation graphs for all three phage virus models and all the frames defining them are the same.The phage representation graph is shown in Figure 4 .All models have exactly two attractors that can be associated with lysis and lysogeny and exactly two switching regions: A and B. The attractor reachability depends on comparative activity levels of two genes g 1 and g 2 .The genes g 1 and g 2 , whose activity irrevocably switches the models to lytic and lysogenic behaviours, are cI and cro for LPH2 and HK022, and c and ner for the Mu model.
Although all 1200 LPH2 frames share the same representation graph shown in Figure 4 , they partition into six equivalence classes with the different topologies of lysis and lysogenic attractors, and only two of these classes (one including 380 and the other 190 frames) can be considered biologically feasible.The inclusion in one of these two classes requires an additional affinity constraint bOR3[cI] < bOL1[cI] for valid lysogeny behaviour, and a bOR2[cro] < bOL2[cro] constraint for valid lysis behaviour.These affinity constraints can be viewed as hypotheses predicted by the model and, in principle, could be validated experimentally, although such validation remains complicated.In both of these two classes, lysogeny attractor regions contain two states, and lysis attractor regions contain 16 states, with transitions in the lysis attractor region being different for each class.The exact type of lysis attractor depends on affinity constraints on cII binding, with the constraint bcII-2 < bcII-3 < bcII-1 leading to one of these lysis types and both alternative constraints bcII-3 < bcII-2 < bcII-1 and bcII-2 < bcII-1 < bcII-3 leading to another.
The default assumption for phage virus models is that, after the initial infection of bacteria, all binding sites should be vacant (with virus proteins not yet present).Thus, of clear relevance are model state space regions reachable from the initial 0-state.In representation graphs of lambdoid phage models, these state space parts are covered by the switching region A (Figure 4 ) and two attractor states, with an irrevocable transition to either lysis and lysogeny behaviours depending on competitive growth rates of cI (g 1 ) and cro (g 2 ) proteins.
However, a topic of increasing research interest is also infection of a single cell by multiple phage viruses (24)(25)(26), in which case reachability from other states and not only the 0-state can become relevant, i.e. the switching region B and the pool region U could also have active biological roles.In principle, the complete state spaces and their representation graphs of our LPH2 model are expected to fully character-ize all the possible behaviours in such multiple infection scenarios (thus, lysis-lysogeny switching could also occur while concentrations of cI and cro decrease).Admittedly, the current model might not fully cover biological interactions that become significant when multiple copies of lambda phage are present within a single cell.
Similarly, all 400 HK022 frames share the same representation graph of Figure 4 .All HK022 frames are contained in a single equivalence class with two attractors that can be well associated with feasible lysis (described by a two state attractor) and lysogeny (described by a 12 state attractor) behaviours ( 17 ).Such a result, whereby all HK022 model frames behave in a very similar and biologically feasible way, independently of additional constraints, is somewhat unexpected, given that the lack of N protein is considered problematic for the model's stability.
For the Mu model, we have two initial frames, from which only the frame satisfying affinity constraint bNL < bNR has two attractor regions that can be well associated with lysis (two state attractor) and lysogeny (four state attractor).The representation graph again is the same as shown in Figure 4 , although in this case proteins triggering the lysis-lysogeny switch are completely different-c (g 1 ) and ner (g 2 ) ( 17 ).The alternative affinity constraint bNL > bNR leads to a state space with a single attractor without lysis-lysogeny differentiation, thus the constraint bNL < bNR is a prediction obtained from the model, which could be validated experimentally.

Myeloid progenitor differentiation model
The second case study describes myeloid cell differentiation in a mouse (used as a model organism due to data availability, with expectations that the model can be adapted and transferred to other mammalian organisms, including humans).It is based on a very detailed Boolean model ( 6) describing a treelike evolutionary process of divergence of common myeloid progenitors (CMPs) into granulocyte-monocyte progenitors (GMPs) and megakaryocyte-erythrocyte progenitors (MEPs) that further evolve into granulocytes, monocytes, megakaryocytes and erythrocytes.The model involves 11 genes, and the authors provide a very detailed assessment of a particular choice of regulatory functions with varying degrees of underlying evidence for the included regulatory interactions and functions-ranging from experimentally confirmed regulatory interactions to co-related gene expression and some hypothetical interactions.The dynamics analysis assumed a fully asynchronous model behaviour, and the authors have identified the main genes contributing to specific evolutionary choices.However, the paper does not provide enough details for the full reproducibility of the results.
Our base HSM model of myeloid cell differentiation (Figure 3 C) has been derived from the Boolean model ( 6) and contains 11 proteins and their corresponding genes: GA T A-1, GA T A-2, FOG-1, EKLF, Fli-1, SCL, C / EBP α, Pu.1, cJun, EgrNab and Gfi-1.The total number of the involved binding sites is 30 (including some that are not confirmed experimentally but implied from regulatory assumptions), with multiple binding sites for most of the proteins.However, due to a lack of evidence that affinities of different sites binding the same gene might differ, for modelling purposes we include only a single binding site bG for each gene G.The gene activities are defined by the same formulae as in the Boolean model ( 6 ), substituting for function arguments the activity states of the corresponding gene-binding sites: The model describes the differentiation of CMPs into granulocyte, monocyte, megakaryocyte and erythrocyte cell types.Reaching steady states corresponding to these cell types is fully determined by activation, correspondingly, of one of the genes Gfi-1, EgrNab , Fli -1 and ELKF .To ensure the stability of these steady states, the model includes explicit inhibitory regulations between Gfi-1, EgrNab and Fli -1, ELKF gene pairs, although experimental evidence supporting these inhibitory regulations is partial and indirect.In particular, it has been noted that direct inhibition of EgrNab by Gfi-1 might not be strictly necessary ( 6 ), and our HSM model is well suited for testing such a hypothesis.We do this by considering a model variant where this inhibitory regulation is absent, and EgrNab regulation is given by the formula: (Similarly, we have tested the need for direct inhibitions of Gfi-1, Fli -1 and ELKF genes; however, these were shown to be necessary for differentiation, according to the model, to occur.) The representation graph of the base model is shown in Figure 5 A. By the model design, all attractors are intended to be single-state regions, thus of primary interest are switching states and trajectories from the CMP state leading to four attractor regions.The model adequately shows the evolution of CMPs into four different cell types and confirms the presence of a well-defined intermediate MEP region (region B).There is no clearly defined GMP region, however, which remains divided into three parts: A, C and D, from which only A can be viewed as a GMP, while from C and D regions there remain evolution paths to the MEP region.This seems consistent with some recent work ( 27 ) indicating that the absence of a clearly defined GMP region might align with experimental observations.However, the competitive concentration drops of GA T A-1 and PU.1 are decisive for the differentiation of CMP into granulocyte-monocyte and megakaryocyte-erythrocyte subtypes.
The representation graph of the model variant lacking EgrNab inhibition by Gfi-1 is shown in Figure 5 B. It confirms the hypothesis that direct EgrNab inhibition might not be necessary for differentiation into stable granulocyte and monocyte cells, although differentiation paths to these states from CMPs become more complex.Parts of the base model representation graph corresponding to CMP and A switching states in the model variant are replaced by two six state cycles S 1 , …S 6 and S 1 , . . .S 6 with identical regions reachable from all the states of these cycles, albeit with different transition conditions to them.The biological accuracy of such cyclic switching trajectories is disputable; however, their presence relies on a number of other unverified assumptions incorporated into the model, and the fact that direct EgrNab inhibition might be unnecessary appears to be generally confirmed by the model.
It should be noted that the Boolean model ( 6), as well as our HSM model derived from it, are explicitly designed to describe only the differentiation phase that starts from the CMP state already being reached and are not intended to describe biological processes leading to CMP.Accordingly, the representation graphs in Figure 5 cover only parts of state spaces reachable from CMP as the initial state.
It might seem tempting to introduce multiple binding sites for some of the genes in our myeloid cell differentiation model (especially where there is good experimental evidence for their existence); however, there is a lack of objective biological evidence to justify this.Moreover, in comparison with virus models, the regulatory network is much more densely interconnected, and most of the genes are directly involved in regulatory activities; it can be checked that by introducing binding affinity constraints only as 'hypotheses for verification' one can quite easily custom design models that have the desired behaviours but lack biological relevance.More generally, this also highlights the differences between using a HSM-based approach for the modelling of small and compact virus gene networks, when we can aim for the development of a model for the entire 'life cycle' of a particular biological process, in comparison with modelling much larger strongly interlinked eukaryotic gene networks, where we are confined to the development of models for only a small part of a biological system.

Discussion
The main goals of the study were the introduction in formalized and well-defined terms of the concept of representation graphs, and the demonstration of the utility of this formalized notion for analysing behaviour dynamics of GRN models.The two presented case studies show the particular suitability of representation graphs for the characterization of processes that lead to irrevocable differentiation of the modelled system's behaviour into several distinct regions of stability.
The first of the case studies is based on our previous work on phage virus modelling.The introduction of the concept of representation graphs has been strongly influenced by informal-style visualizations that we have used to illustrate the phage models' behaviours.However, the definition of this notion in a formal and rigid manner allows these results to be captured in a compact and uniform way that provides a precise understanding of the evolutionary dynamics the models allow.In particular, the realization that all viable phage virus models share the same representation graph describing their state space structures was unanticipated before these compact representations were obtained.
The second case study proposes a redefinition of the known Boolean model of myeloid cell differentiation in terms of the HSM framework.Besides showing the advantages of this framework for model description, it very well demonstrates the merits of representation graphs for providing a detailed understanding of differentiation processes that govern the evolution of a progenitor state cell into one of the four mature cell subtypes.Both of these case studies also show the capacity of the modelling framework to suggest and validate new hypotheses about regulatory interactions.
The presented algorithm for computing representation graphs from state spaces of the modelled systems essentially The model still provides complete separation between steady states corresponding to granulocyte and monocyte cell types, although differentiation paths to these states from CMPs become more complex.Single switching states CMP and A are replaced by six state cycles; transition to and from dotted rectangles containing these cycles apply to the all states contained in them.Notably, both models ha v e regions (B) that can be well associated with MEP states, but there are no regions that unambiguously can be associated with GMP states.
shows that their use for analysing the models' behaviour is practical as long as the models' state spaces can be computed efficiently.The size of state spaces grows exponentially with n (in the case of HSM, n denotes the number of the modelled protein-binding sites).This limits the upper size of the models to which HSM framework-and / or representation graphbased analysis can be effectively applied to n ∼ 30; the value is similar to the upper size of Boolean GRN models that can still be fully analysed.
By their definition, the application of representation graphbased behaviour analysis is meaningful only for models that describe some process of differentiation in which the system can evolve to more than one stable state; otherwise their representation graphs are trivial.Admittedly, the number of known or published GRN models (including those presented in not particularly rigid terms) of such a type remains rather limited.Nevertheless, for models describing some kind of differentiation process, the representation graphs will provide answers to the most natural questions about the model's behaviour that one might ask.As far as we are aware, apart from the case of simple Boolean GRN models, no comparable formalized methods for comprehensive analysis of the structure of state spaces of the modelled biological systems have been proposed before.
As they are used here, the representation graphs are specific to a particular hybrid system-based modelling framework.However, as the example of myeloid cell differentiation models shows, there appears to be a good potential to adapt them to other GRN modelling approaches in which the networks can be described in terms of discrete state spaces.
then it is deleted.The function Reach ( S ) returns the union of the sets of vertices reachable from v ∈ S .The algorithm initially computes SCCs and their reachability from other vertices in (| V | + | E |) steps.Each of the remaining steps reduces the number of vertices by 1, thus there are at most ( O | G |) of them, and, by performing Join operations in appropriately chosen order, this part of the algorithm can also be implemented in O (| V | + | E |) steps.To the algorithm's time complexity, however, the sizes of edges labels | l ( e )| and the sizes of sets of reachable vertices | Reach ( D i )| also contribute.| l ( e )| is bounded by the number of genes n ; the upper bound for the number of genes n in models for which the HSM framework could be applied is ∼30, and thus can be treated as constant.| Reach ( D i )|, however, is bounded by O (| A ( G )|) = O (| V |), and it is not difficult to construct artificial models with | Reach ( D i )| = (| V |), leading to O (| V | 2 + | E |) worst case time complexity.For useful models, however, one should expect only a small number of attractors, in which case the time complexity becomes bounded by O (| V | + | E |).Thus, representation graphs can be effectively constructed for all HSM models for which state space graphs G = (M, ˆ T ) themselves are computable.

Figure 1 .
Figure 1.Representation graphs and the str uct ure of lysis and lysogeny attractors for two lambda phage models.The visualizations are created with yEd diagram editor from GraphML output files by manually adjusting the layout and grouping vertices.In representation graphs (left), switching regions are sho wn b y rhombi and attractors b y coloured rectangles; pool regions are depicted b y gre y rectangles.T he attractor graphs (right) sho w the corresponding parts of state spaces that are contracted in single attractor nodes in representation graphs.Only for model ( A ), lysis attractor corresponds to biologically feasible behaviour, whilst for model ( B ) it does not.Both models, however, have the same representation graph str uct ure.

Figure 2 .
Figure 2. Workflow of HSM model analysis.A model is described by N state spaces, G i corresponding to its HSM frames.From these we compute representation graphs RG i and their attractor set AG i .Representation graphs are then further partitioned into equivalence classes defined by isomorphism.

Figure 3 .
Figure 3. Schematic networks of gene interactions for lambda and HK022 phage ( A ), Mu phage ( B ) and myeloid cell differentiation ( C ) models.Sites binding two different transcription factors and sites binding the same transcription factor with opposite regulatory impacts on the regulated gene(s) are e xplicitly sho wn; otherwise, the presence of a single binding site is implicitly assumed f or each regulatory interaction.Lambda and HK022 phage models (A) differ by regulation of gene Q , in lambda phage Q being regulated by N and cII (as shown with dashed lines), and in HK022 phage being regulated by cro and cI (as shown with dotted lines); in HK022 phage, gene N is absent.For the myeloid differentiation model ( C ), reaching one of the four steady states is fully determined by activation of one of the genes Fli -1, ELKF , Gfi-1 or EgrNab .In a model variant, direct inhibition of EgrNab by Gfi-1 is absent.

Figure 4 .
Figure 4. Representation graph describing state spaces of all phage virus models (lambda, HK022 and Mu phages).All models ha v e e xactly tw o attractors that can be associated with lysis and lysogeny and exactly two switching regions: A and B.The attractor reachability depends on comparative activity levels of two genes g 1 and g 2 (the in v olv ed genes differ for lambdoid and Mu phages).Depending on the model, the pool regions of non-switching states S and U could also be empty.

Figure 5 .
Figure 5. Representation graphs for the myeloid cell differentiation model.( A ) Base model with explicit inhibitory gene interactions preventing transitions between granulocyte-monocyte and megakar yocyte-er ythrocyte state pairs as soon as one of these steady states has been reached.( B ) A model variant without explicit inhibition of the EgrNab gene by Gfi-1 .The model still provides complete separation between steady states corresponding to granulocyte and monocyte cell types, although differentiation paths to these states from CMPs become more complex.Single switching states CMP and A are replaced by six state cycles; transition to and from dotted rectangles containing these cycles apply to the all states contained in them.Notably, both models ha v e regions (B) that can be well associated with MEP states, but there are no regions that unambiguously can be associated with GMP states.